How to Intelligently Distribute Training Data to Multiple Compute Nodes: Distributed Machine Learning via Submodular Partitioning

نویسندگان

  • Kai Wei
  • Rishabh Iyer
  • Shengjie Wang
  • Wenruo Bai
  • Jeff Bilmes
چکیده

In this paper we investigate the problem of training data partitioning for parallel learning of statistical models. Motivated by [10], we utilize submodular functions to model the utility of data subsets for training machine learning classifiers and formulate this problem mathematically as submodular partitioning. We introduce a simple and scalable greedy algorithm that near-optimally solves the submodular partitioning problem. We empirically demonstrate the efficacy of the proposed algorithm to obtain data partitioning for distributed optimization of convex and deep neural network objectives. Empirical evidences suggest that the intelligent data partitioning produced by the proposed framework leads to faster convergence in the case of distributed convex optimization, and better resulting models in the case of parallel neural network training.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications

We study two mixed robust/average-case submodular partitioning problems that we collectively call Submodular Partitioning. These problems generalize both purely robust instances of the problem (namely max-min submodular fair allocation (SFA) Golovin (2005) and min-max submodular load balancing (SLB) Svitkina and Fleischer (2008)) and also generalize average-case instances (that is the submodula...

متن کامل

Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning

Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems are sparse, which admits efficient partition in order to reduce the communication overhead. In this paper, we formulate data placement as a graph partitioni...

متن کامل

Some Submodular Data-Poisoning Attacks on Machine Learners

The security community has long recognized the threats of data-poisoning attacks (a.k.a. causative attacks) on machine learning systems [1–6, 9, 10, 12, 16], where an attacker modifies the training data, so that the learning algorithm arrives at a “wrong” model that is useful to the attacker. To quantify the capacity and limits of such attacks, we need to know first how the attacker may modify ...

متن کامل

ENERGY AWARE DISTRIBUTED PARTITIONING DETECTION AND CONNECTIVITY RESTORATION ALGORITHM IN WIRELESS SENSOR NETWORKS

 Mobile sensor networks rely heavily on inter-sensor connectivity for collection of data. Nodes in these networks monitor different regions of an area of interest and collectively present a global overview of some monitored activities or phenomena. A failure of a sensor leads to loss of connectivity and may cause partitioning of the network into disjoint segments. A number of approaches have be...

متن کامل

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Deep learning models, which learn high-level feature representations from raw data, have become popular for machine learning and artificial intelligence tasks that involve images, audio, and other forms of complex data. A number of software “frameworks” have been developed to expedite the process of designing and training deep neural networks, such as Caffe [11], Torch [4], and Theano [1]. Curr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015